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Abstract 

This paper presents a method for future localization: to 
predict a set of plausible trajectories of ego-motion given a 
depth image. We predict paths avoiding obstacles, between 
objects, even paths turning around a corner into space be¬ 
hind objects. As a byproduct of the predicted trajectories 
of ego-motion, we discover in the image the empty space 
occluded by foreground objects. We use no image based 
features such as semantic labeling/segmentation or object 
detection/recognition for this algorithm. Inspired by prox- 
emics, we represent the space around a person using an 
EgoSpace map, akin to an illustrated tourist map, that mea¬ 
sures a likelihood of occlusion at the egocentric coordinate 
system. A future trajectory of ego-motion is modeled by a 
linear combination of compact trajectory bases allowing 
us to constrain the predicted trajectory. We learn the re¬ 
lationship between the EgoSpace map and trajectory from 
the EgoMotion dataset providing in-situ measurements of 
the future trajectory. A cost function that takes into account 
partial occlusion due to foreground objects is minimized to 
predict a trajectory. This cost function generates a trajec¬ 
tory that passes through the occluded space, which allows 
us to discover the empty space behind the foreground ob¬ 
jects. We quantitatively evaluate our method to show pre¬ 
dictive validity and apply to various real world scenes in¬ 
cluding walking, shopping, and social interactions. 

1. Introduction 

Consider a dynamic scene such as Figurel^where you, as 
the camera wearer, plan to pass through the corridor in the 
shopping mall while others walk in different directions. You 
need to plan your trajectory to avoid collisions with others 
and objects such as walls and fence. Looking ahead, you 
would plan a trajectory that enters into the shop by turn¬ 
ing left at the corner although such space cannot be seen 
directly from your perspective. 

The fundamental problem we are interested in is future 
localization: where am I supposed to be after 5, 10, and 15 
seconds? This challenging task requires understanding of 
the scene in terms of a long term temporal human behaviors 
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Figure 1. Where am I supposed to be after 5, 10, and 15 seconds? 
We present a method to predict a set of plausible trajectories given 
a first person depth image. As a byproduct of the predicted trajec¬ 
tories, the occluded space by foreground objects such as the space 
inside of the shop or behind the ladies are discovered. 

with respect to the spatial scene layout, with missing data 
due to occlusions. 

We study the future localization problem using a first 
person depth (stereo) camera. We present a method to pre¬ 
dict a set of plausible trajectories of ego-motion given a 
depth image captured from a egocentric view. As a byprod¬ 
uct of predicted trajectories, the occluded space behind 
foreground objects is discovered. Our method purely re¬ 
lies on the depth measurements, i.e., no image based fea¬ 
tures such as semantic labeling/segmentation or object de¬ 
tection/recognition are required. 

Inspired by proxemics pQ| , we represent the space 
around a camera wearer using an EgoSpace map which re¬ 
assembles an illustrated tourist map: an overhead map with 
objects seen from first person video projected onto it. 

A predictive future localization model, using the 
EgoSpace map, is learned from in-situ first person stereo 
videos from various life logging activities such as com¬ 
mutes, shopping, and social interactions. By leverag¬ 
ing structure from motion, camera trajectories are recon¬ 
structed. These camera trajectories are associated with its 
depth image at each time instant, i.e., given the depth im¬ 
age, a future camera trajectory is precisely measured while 
the depth image is obtained by the stereo camera^as shown 
in Figure [2^ 

In a training phase, we discriminatively learn the rela¬ 
tionship between the EgoSpace map and future camera tra- 

^ Any depth sensor such as Kinect and Creative Senz3D are complimen¬ 
tary to our depth measurement. 
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jectory. We model a trajectory of ego-motion using a linear 
combination of compact trajectory bases. By the nature of 
the alignment between ego-motion and gaze direction, the 
trajectory is highly structured. We empirically show that 
4^6 linear trajectory bases are sufficient enough to express 
all plausible trajectories of ego-motion with high precision 
(99% accuracy). This compact representation allows us to 
efficiently find a set of trajectories that are compatible with 
the associated depth image using EgoSpace map matching. 
This provides an initialization of the predicted trajectories. 
However, not all these Te-imagined’ trajectories avoid ob¬ 
jects in the current first person view. We refine it by mini¬ 
mizing a cost function that takes into account compatibility 
between the obstacles in EgoSpace map and trajectory. This 
cost function explicitly models partial occlusion of a tra¬ 
jectory which allows us to discover the space behind fore¬ 
ground objects. 


Why EgoSpace map? Two cues are strongly related to pre¬ 
dict a trajectory of ego-motion, e.g., where is he or she go¬ 
ing? (1) ego-cue: a vanishing point is often aligned with 
gaze direction; and 2D visual layout of the obstacles in the 
first person view implicitly encodes the semantics of the 
scene. (2) exo-cue: objects in a 3D scene such as road, 
buildings, and tables constrain the space where the wearer 
can navigate. Such cues can be explicitly extracted by an 
ego-depth image where the gaze direction of the wearer can 
be calibrated with respect to a ground plane (exocentric co¬ 
ordinate) while the depth provides obstacles with respect to 
the wearer (egocentric coordinate). Our EgoSpace map rep¬ 
resentation exploits these two cues where we measure depth 
from an egocentric view, and create an illustrated tourist 
map representation capturing both 2D visual arrangement 
of the obstacles (in first person view) and their 3D layout 
(in overhead view). This representation allows us to analyze 
and understand different scene types and gaze directions in 
the same coordinate system. 


Contributions To our best knowledge, this is the first pa¬ 
per that predicts ego-motion from a depth image without 
semantic scene labels or object detection via in-situ first 
person measurements. Core technical contributions of our 
paper are (a) a predictive model that describes a spatial dis¬ 
tribution of objects with respect to an egocentric view, al¬ 
lowing us to register different scenes in a unified coordinate 
system; (b) a compact subspace representation of the pre¬ 
dicted trajectories enabling a search for trajectory parame¬ 
ters feasible without explicit modeling of dynamics of hu¬ 
man behaviors; (c) occluded space discovery through tra¬ 
jectory prediction; and (d) the EgoMotion dataset with a 
depth and its long term camera trajectory, which includes 
diverse daily activities across camera wearers. We evaluate 
our algorithm to predict ego-motion in real world scenes. 


2. Related Work 


Our framework lies an intersection between behavior 
prediction and egocentric vision. 

2.1. Human Behavior Prediction 

Predicting where-to-go is a long standing task in behav¬ 
ioral science. This task requires to understand the interac¬ 
tions of agents with objects in a scene that afford a space to 
move. There is a large body of literature on human behav¬ 
iors prediction algorithms. Pentland and Lin modeled 
human behaviors using a hidden Markov dynamic model 
to recognize driving patterns. Such Markovian model is 
an attractive choice to encode human behaviors because 
it refiects the way humans make a decision fT7lp^|38| . 
These models, especially partially observable Markov de¬ 
cision process (POMDP), have infiuenced motion planning 
in robotics | [T5l[^[3ll . 

In computer vision, Ali and Shah |[^ developed a flow 
field model that predicts spatial crowd behaviors for track¬ 
ing extremely cluttered crowd scenes. Inspired by the social 
force model GD, Mehran et al. predicted pedestrian 
behaviors in a crowd scene to detect abnormal behaviors, 
and Pellegrini et al. | [27| us ed a modified model to track 
multiple agents. Ryoo j|32| presented a bag-of-word ap¬ 
proach to recognize social activities at the early stage of 
videos. Vu et al. p6| predicted plausible activities from a 
static scene by associating the scene statistics and labeled 
actions. In terms of the trajectory prediction task, our work 
is closely related with three path planning frameworks by 
Gong et al. j^, Kitani et al. |[^, and Alahi et al. 0. Gong 
et al. presented a method to generate multiple plausible 
trajectories of each agent in the scene constructed by ho- 
motopy classes, which allows them to produce a long term 
trajectory for visual tracking in crowd scenes. Kitani et 
al. leveraged inverse optimal control theory to learn human 
preference with respect to the scene semantic labels, which 
enables them to predict the paths an agent follows. Alahi 
et al. introduced a geometric feature, social affinity model 
that captures a spatial relationship of neighboring agents to 
predict destinations of a crowd. 

Unlike previous methods that use semantic la¬ 
bels/segmentation or object detection/tracking which 
are often noisy in real world scenes, our measurements 
are a single depth image that can be reliably obtained 
by stereo cameras or depth sensors. Estimating optimal 
parameters for Markovian models is often intractable. In 
contrast, our trajectory representation in a egocentric view 
can be encoded using compact trajectory bases, thus it 
makes learning tractable because of the reduced number of 
parameters. 

2.2. Egocentric Vision 

A first person camera is an ideal camera placement to 
observe human activities because it refiects the attention of 
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(a) Ego-stereo cameras (b) Geometry (c) Depth image (d) EgoSpace map 

Figure 2. (a) We use ego-stereo cameras to capture our dataset where the depth image can be computed. Any depth sensor such as Kinect is 
complementary to our stereo setup, (b) Inspired by proxemics, we represent the space around a person using an EgoSpace map computed 
from (c) the depth image, (d) The EgoSpace map, 0(r, 0), captures a likelihood of occlusion. 


the camera wearer. This characteristics provides a powerful 
cue to understand human behaviors |[5l[7l [T^|M|[^ . 

Kitani et al. used scene statistics produced by cam¬ 
era ego-motion to recognize sport activities from a firse per¬ 
son camera. Traditional vision frameworks such as object 
detection, recognition, and segmentation frameworks are 
successfully integrated in first person data: Pirsiavash and 
Ramanan p0| recognized daily activities using deformable 
part models, Lee et al. found important persons and ob¬ 
jects, Fathi et al. discovered objects, and Li et al. pQpT| 
segmented pixels corresponding to hands. In a social set¬ 
ting, Fathi et al. 0 presented a method to recognize social 
interactions by detecting gaze directions of people and Park 
et al. p5| introduced an algorithm to reconstruct joint atten¬ 
tion in 3D by leveraging 3D reconstruction of camera ego- 
motion. This reconstruction allows prediction of joint atten¬ 
tion possible by learning the spatial relationship between a 
social formation and joint attention p6| . 

Such characteristics of first person cameras were used 
to generate interesting applications in vision, graphics, and 
robotics. Lee et al. fT^ summarized a life logging video, 
Xiong et al. p7| detected iconic images using a web image 
prior. Arev et al. 0 used 3D joint attention to edit social 
video footages and Kopf et al. O used 3D camera mo¬ 
tion to generate a hyperlapse first person video. In robotics, 
Ryoo et al. p4| predicted human activities for human-robot 
interactions. 

Unlike most previous methods, our task primarily fo¬ 
cuses on predicting future behaviors by leveraging in- 
situ measurements from 3D reconstruction of camera ego- 
motion. This also allows us to tackle a more challenging 
problem—to discover an empty space that is not observable 
because of visual occlusion. 

3. Representation 

Inspired by proxemics eg. we present a characteriza¬ 
tion of space with respect to the egocentric coordinate sys¬ 
tem, called EgoSpace map. 


3.1. EgoSpace Map 


EgoSpace Map is a representation for space experienced 
from first-person view but visualized in an overhead bird- 
eye map, akin to an illustrated tourist map. 

It has three key ingredients. Eirst, we define an ego¬ 
centric coordinate system centered at the feet location, the 
projection of the center of eyes onto the ground plane as 
shown in Eigure |2(b)| The normal direction, n, of the 
ground plane is aligned with the Y -axis, and the height of 
the eye location is h, i.e., c = [ 0 h 0 ] ^ where c is the 
3D location of the center of eyes. The gaze direction de¬ 
fines tangential directions of the ground plane: the Z-axis 
is aligned with the projection of the gaze direction, v, i.e., 
V = [ 0 Vy ] . 

Second, the EgoSpace encodes depth cue from a first per¬ 
son view onto an overhead view on the ground plane. Us¬ 
ing a log-polar 0(r, 6>) parametrization of the X-Z (ground) 
plane, we define EgoSpace Map as a function cj) : R xS^ ^ 
R, measuring likelihood of occlusion introduced by fore¬ 
ground objects from the gaze direction. One can think of 
the eye gaze is a light source shining on foreground objects 
casting shadows onto the ground plane. On the shadow im¬ 
age we record the object height which is proportional to the 
occlusion likelihood. 

Eormally, (j){r^0) measures the height of the point, u, 
from the ground plane that intersects the ray, g, from the 
center of eyes, c, to (r, 0) with an occluding object, O, i.e., 


e) = 


u^n, 


( 1 ) 


where u = minAG£ Ag + c such that C = {A|Ag + c C 
A > 0}. {Oi}f^i is a set of objects in the scene. 
We discretize the polar coordinate system by uniform 
sampling in angle between tt/G and 57r/6 and uniform sam¬ 
pling in the inverse of radius which results in uniform sam¬ 


pling in the egocentric view as shown in Eigure 2(c) Note 
that the locations to measure the EgoSpace map are al¬ 
most radially uniform from the first person view point. Eig¬ 
ure [2^ shows the EgoSpace map for Eigure [2^ 
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For future localization, ground plane provides a free 
space for us to move into. On the EgoSpace map, 0(r, 0) = 
0 if from the first person view the point (r, 0) lies on the 
ground plane. More interestingly, the space behind an ob¬ 
ject also indicates potential places to navigate. Since the 
EgoSpace map is represented in the ground plane, not in 
first person view, the space behind the object are marked as 
occluded area (the right few columns of the map). 

Third, the area outside of a first person view depth image 
boundary is set to 0max = 2m. On the EgoSpace map, 
shape of the mask is uniquely defined by the gaze direction 
(roll and pitch angles of the head direction). For example. 
Figure [2(c)| shows a case where the wearer is looking ahead 
almost parallel to the ground, the ground area close to the 
wearer (r < 0.5m) was not visible e.g., (j){r < 0.5m, 0) is 
marked as 0 = 0max- If the wearer is looking down, the 
masked area on EgoSpace would be for large values of r. 

The EgoSpace representation supports Xtdzmmg future lo¬ 
calization from first person videos by combining cues from 
3D scene geometry and gaze direction. Its benefits include: 
1) the gaze direction normalized coordinate system pro¬ 
vides a common 3D reference frame to learn; 2) overhead 
view representation removes the variations in first person 
3D experience due to the head’s pitch angle, 3) the log-polar 
encoding and sampling gives more importance to nearby 
space, and 4) the depth masking encodes implicitly both roll 
and pitch angle of head, making it more situation aware. 

3.2. Compact Trajectory Representation 

Let X = [ zi • • • xp zp Y ^ ^ 

trajectory on the ground plane of the egocentric coordinate 
system, where F is the number of future frames to predict 
and Xi and Zi are two coordinates at the time instance 
as shown in Figure [2(b)l In practice, this trajectory can be 
obtained by projecting 3D camera poses between the / + 1 
and f F F time instances at the time instant onto the 
ground plane. This allows us to represent all trajectories in 
the same egocentric coordinate system, which are normal¬ 
ized by gaze direction because the Z axis is aligned with the 
gaze direction. 

The gaze direction normalized trajectory is highly com¬ 
pressible. Most trajectory of ego-motion can be encoded 
using a linear combination of trajectory bases learned using 
Principal Coordinate Analysis (PCA) from the EgoMotion 
dataset described in Section |5j 


X = B/3 + X, 


( 2 ) 


where X is a mean trajectory and B G is a col¬ 

lection of trajectory bases, i.e., each column of B is a tra¬ 
jectory basis where K is the number of basis. In practice, 
K is selected as 4^6 which can express all ego-m otion tra- 
jectories_with 99% accuracy as shown in Eigure |3(a)| and 
Eigure 3(b) /3 G is the trajectory coefficient, which is 
the low dimensional parametrization of X. In Eigure [3(b)l 


we compare reconstruction error produced by PCA bases 
and DCT generic bases Q. 



X (m) Number of bases 


(a) Ego-motion trajectories (b) Reconstruction error 

Figure 3. (a) We register all trajectories in an ego-centric coordi¬ 
nate system, which results in highly redundant trajectories that can 
be represented by a linear combination of (b) compact trajectory 
bases. 


4. Prediction 

A trajectory of ego-motion is associated with an 
EgoSpace map, i.e., given a depth image, we know how we 
explored the space in the training data (Sectionj^. By lever¬ 
aging a computational representation of egocentric space 
and trajectory described in Section in this section, we 
present a method to predict a set of plausible trajectories 
given an EgoSpace map and to discover the occluded space 
using the predicted trajectories. 

4.1. Ego-motion Prediction 

Estimating X that conforms to a depth image is to find a 
path that stays in the ground plane minimizing the following 
cost function along the trajectory: 

minimize YY ^ ^ (2) 

where 0 : R^ ^ R is the Cartesian coordinate representa¬ 
tion of the EgoSpace map, f and B^ G R^^^ is a matrix 
composed of the (2 (i — 1) -h 1)^^ and rows of B. There¬ 
fore, B^/3 is the point (x^, Zi) at the time instant. 

Equation ^ finds a trajectory that stays on the ground 
given a depth image. This approach has been used in 
robotics communities for various path planning tasks. How¬ 
ever, this does not take into account the trajectory that is par¬ 
tially occluded by objects because the occluded part of the 
trajectory always produces higher cost. Instead, we intro¬ 
duce a novel cost function that minimizes a trajectory cost 
difference between the given depth image and the retrieved 
depth image from the database: 

minmize max (^0,0 (Bj/3) - 4>d (Bj/3£,)) , (4) 

where fp and (3p are the EgoSpace map and trajectory pa¬ 
rameter retrieved from the training dataset. This minimiza¬ 
tion finds a partially occluded trajectory as long as there 
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exists a trajectory in the database that has similar occlusion 
cost. 

There exist infinite number of trajectories that are com¬ 
patible with a given EgoSpace map. More importantly, the 
cost function in Equation Q is nonlinear where an initial¬ 
ization of the solution is critical. 

We initialize (3 using a trajectory retrieved from the train¬ 
ing data by EgoSpace map matching. The dataset is di¬ 
vided into 3 gaze directions (3 pitch angles) to reduce the 
false matches dominated by the area beyond the depth im¬ 
age. Given an EgoSpace map, k-nearest neighbors (KNN) 
are found using K-d tree 1^ . Other search or planning 
methods such as structured SVM p5| and Rapidly Explor¬ 
ing Random Tree (RRT) fT^ can be complimentary to the 
KNN search. 

4.2. Occluded Space Discovery 

The predicted trajectories of ego-motion allow us to dis¬ 
cover the hidden space occluded by foreground objects be¬ 
cause the trajectories can be still predicted in the hidden 
space. We build a likelihood map of the occluded space as 
follows: 

^ EU ^f=i (-11^ - II^ 

E/=iEf=i exp (-||x-Bi/3, II V2a2) ’ 

where '^(x) is the likelihood of the occluded space that a 
trajectory can pass through at the evaluating point x G 
in the ground. j3j is the predicted trajectories, J is the 
number of predicted trajectories, and a is the bandwidth for 
the Guassian kernel. Equation 0 takes into account the 
likelihood of the predicted trajectories weighted by the like¬ 
lihood of the occlusion. V^(x) is high when many trajecto¬ 
ries are predicted at x while 0(x) is high. 

5. EgoMotion Dataset 

We present a new dataset, EgoMotion dataset, captured 
by first person stereo cameras. This dataset includes various 
indoor and outdoor scenes such as Park, Malls, and Campus 
with various activities such as walking, shopping, and social 
interactions. 

5.1. Data Collection 


3D Reconstruction of Ego-motion We reconstruct a cam¬ 
era trajectory using a standard structure from motion 
pipeline with a few modifications to handle a large num¬ 
ber of image^ We partition the dataset such that each 
dataset includes less than 500 images with sufficient overlap 
with neighbor image sets (100 image overlap). We recon¬ 
struct each dataset independently and merge them by mini¬ 
mizing cross reprojection error between two dataset, i.e., a 
point in one dataset is reprojected to a camera in the other 
dataset. Then, we project the reconstructed camera trajec¬ 
tory onto the ground plane estimated by fitting a plane using 
RANSAC |[8}. 

Scenes We collect both indoor and outdoor data, which con¬ 
sists of 21 scenes with 55,933 frames of 7.7 hours long in 
total, including walking on campus, in parks and downtown 
streets, shopping in the mall, cafe and grocery, as well as 
taking public transportation. The data consists of various 
activities (walking, talking, and shopping), scenes (campus, 
park, malls, and downtown streets), cities, and time. We 
also collect repeated daily routines multiple times at a cam¬ 
pus. The dataset is summarized in Table 

5.2. Data Analysis 

We define the EgoSpace map with respect to a gaze di¬ 
rection, which allows us to canonicalize all trajectories in 
one coordinate system and further to represent it with com¬ 
pact bases. This stems from a primary conjecture: a gaze 
direction is aligned with ego-motion. In this section, we em¬ 
pirically prove the conjecture from our EgoMotion dataset. 



Direction of destination (radian) 



Direction of destination 


(a) Attention (b) Yaw distribution 

Figure 4. From our dataset, we empirically prove that the gaze 
direction is highly correlated with the direction of destination, i.e., 
we look where we go. 


A stereo pair of GoPro Hero 3 (Black Edition) cam¬ 
eras with 100mm baseline are used to capture EgoMotion 
dataset as shown in Eigure |2(a)| All videos are recorded at 
1280 x960 with lOOfps. The stereo cameras are calibrated 
prior to the data collection and synchronized manually with 
a synchronization token at the beginning of each sequence. 
Depth Computation We compute disparity between the 
stereo pair after stereo rectification. A cost space of stereo 
matching is generated for each scan line and match each 
pixel by exploiting dynamic programming in a coarse-to- 
fine manner. 


We compute the pitch angle of a gaze direction by cali¬ 
brating the relationship between the first person camera and 
gaze direction The pitch angle is cos“^ Vz by defini¬ 
tion in Section [ tT| and the position after 10 seconds is used 
to measure the direction of destination. Eigure |4(a)| shows 
a distribution of the direction of destination with respect to 
the gaze pitch angle, which indicates that the gaze direction 
is aligned with the pitch axis. Eigure [4(b)] shows a yaw dis¬ 
tribution of the direction of destination given pitch angle (a 

^ A 30 minute walking sequence at a 30 fps reconstruction rate produces 
HD 108,000 images. 
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Scene 

IKEA 

Costco 

Mall 

Park 

Schooll/2 

Downtown 1/2 

Grocery1/2/3 

Bus 1/2 

Frames 

966 

577 

2683 

3088 

3754/3736 

2856/3405 

2858/2892/2834 

2292/1850 

Duration 

08:03 

04:49 

22:22 

25:44 

31:17/31:08 

23:48/28:23 

23:49/24:06/23:37 

19:06/15:25 
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Scene 

Campus I 

Campus2 

Campus3 

Campus4 

Campus5 

Campus6 

Campus7 

Campus 8 

Frames 

2607 

1884 

1975 

2359 

3337 

4034 

2568 

3378 

Duration 

21:44 

15:42 

16:28 

19:40 

27:49 

33:37 

21:24 

28:09 


Table 1. EgoMotion dataset 


horizontal cross section of Figure |4(a)| ). This also indicates 
that gaze direction is highly correlated with the direction of 
destination. 


6. Result 


We apply our method to predict ego-motion and hidden 
space in real world scenes by leveraging the EgoMotion 
dataset. We divide all scenes into two categories: indoor 
and outdoor scenes as ego-motion has different characteris¬ 
tics, e.g., speed and scene layout. Note that for all evalua¬ 
tions, we predict a scene that is not included in training data, 
i.e., training and testing scenes are completely separated. 


6.1. Quantitative Evaluation 

We quantitatively evaluate our trajectory prediction by 
comparing with ground truth trajectories achieved by 3D 
reconstruction of the first person camera. Our evaluation 
addresses the future localization problem. 

Multiple trajectories are often equally plausible, e.g., Y- 
junction, while one ground truth trajectory is available per 
image. This results in a large prediction error. To ad¬ 
dress this multiple path configuration, we measure predic¬ 
tive precision—how often one of our predicted trajecto¬ 
ries aligns with the ground truth trajectory, i.e., prec. = 
where N is the number of testing images. 
Vi = 1 if min/c max^ ||Xt — X^|| < e, and Vi = 0 other¬ 
wise where X^ is the location at the time instant of the 
predicted trajectory and X is the ground truth trajectory. 
We set e = 1.5m. Note that unlike previous approaches 

@0 


our 


measured a spatial distance between trajectories 
evaluation measures a spatiotemporal distance between tra¬ 
jectories because the time scale also needs to be considered. 

Four baseline method^ are used to compare our ap¬ 
proach: one method solely based on gaze direction, two 
methods with a subsampled depth image at the same resolu¬ 
tion of our EgoSpace map, and one method with EgoSpace 
map but without trajectory refinement by Equation Q. (1) 
Going straight: we generate a trajectory aligned with the 
gaze direction to test gaze bias; (2) Pure 2D: we retrieve a 
set of trajectories using KNN solely based on a subsampled 
depth image; (3) 2D-Fground plane: we retrieve trajectories 


^ A dynamic time warping was used to handle a time scale. 

^These baseline algorithms are designed by ours because no previous 
algorithm exists to predict the trajectories of ego-motion 


using the subsampled depth image but transform the coor¬ 
dinates of the trajectories such that they lie on the ground 
plane of the test image. This coordinate transform takes 
into account the 3D camera direction with respect to the 
ground plane of the test image; (4) EgoSpace w/o trajectory 
optimization: the trajectories are retrieved by the EgoSpace 
map but no adaptation to the test image by Equation Q. In 
fact, this method provides an initialization of our predicted 
trajectories. 

Figure shows evaluations on indoor and outdoor depth 
images. We retrieve k neighbors from dataset and measure 
precision. Our method outperforms the baseline algorithms 
with large margin. These experiments indicate that the 
EgoSpace representation has strong predictive power com¬ 
paring to the camera pose oriented feature produced by the 
subsampled depth image. Also the scene adaptation by the 
trajectory optimization allows us to produce more accurate 
prediction (see the performance gap from the initialization). 
As noted in Section [5^ a gaze direction is a good predictor 
but it is not strong enough to predict a long term behavior. 
Note that the precision at early k may be significantly im¬ 
proved by using N-best algorithms pd] based on homotopy 
class 0 because KNN retrieves many redundant trajecto¬ 
ries. In Table we measure the average precision across 
all scenes in Section O 


Indoor 

0~5 secs 

5~10 secs 

10~15 secs 

k=100 

1 k=60 1 

k=30 

k=100 

1 k=60 1 

k=30 

k=100 

1 k=60 1 

k=30 

Going straight 

0.571 

0.221 

0.124 

Pure 2D 

0.643 

0.507 

0.308 

0.524 

0.379 

0.217 

0.346 

0.229 

0.123 

2D+Ground plane 

0.710 

0.556 

0.367 

0.561 

0.413 

0.267 

0.384 

0.261 

0.162 

EgoSpace w/o opt. 

0.690 

0.534 

0.341 

0.570 

0.265 

0.255 

0.401 

0.265 

0.156 

EgoSpace w/ opt. 

0.825 

0.687 

0.458 

0.693 

0.543 

0.331 

0.482 

0.347 

0.192 





Outdoor 

0~5 secs 

5~10 secs 

10~15 secs 

k=100 

1 k=60 1 

k=30 

k=100 

1 k=60 1 

k=30 

k=100 

1 k=60 1 

k=30 

Going straight 

0.443 

0.259 

0.103 

Pure 2D 

0.535 

0.506 

0.303 

0.417 

0.391 

0.218 

0.267 

0.255 

0.142 

2D+Ground plane 

0.554 

0.554 

0.350 

0.425 

0.407 

0.244 

0.293 

0.261 

0.135 

EgoSpace w/o opt. 

0.567 

0.527 

0.329 

0.432 

0.399 

0.233 

0.289 

0.250 

0.141 

EgoSpace w/ opt. 

0.683 

0.666 

0.441 

0.538 

0.522 

0.298 

0.373 

0.355 

0.171 


Table 2. Average precision (k is the number of neighbors) 


Occluded Space Discovery We quantitatively evaluate our 
occluded space discovery by measuring detection rate, 
D/N where D is the number of true positive detection and 
N the total number of detection produced by the space dis¬ 
covery. We threshold the likelihood of the occluded space, 
t/;, from Equation 0 and manually evaluate whether the 
detection is correct. Note that no ground truth label is avail¬ 
able unless the camera wearer already had passed through 
the space. The detection rate in Table indicates that our 
method predicts the outdoor scenes better than the indoor 
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Ground truth ego-motion Input: depth image Preeision at 0~5 sees Precision at 5~10 secs Precision at 10-15 secs 



Number of predicted trajectories, k Number of predicted trajectories, k Number of predicted trajectories, k 

Figure 5. We compare our method with four baseline representations: (1) Going straight; (2) Pure 2D: no EgoSpace map without adaptation 
of the ground plane by the test scene; (3) 2D + ground plane: no EgoSpace map with adaptation of the ground plane by the test scene; (4) 
EgoSpace without trajectory optimization. Our method outperforms other representations. 


scenes. This is because the indoor scenes such as Grocery 
and IKEA, the camera wearer had a number of close inter¬ 
actions with objects such as shelves or products where the 
view of the scenes are substantially limited. 


Indoor 

Main 

Grocery 

IKEA 

Detection rate 

0.5882 

0.2371 

0.3937 

Outdoor 

Park 

Bus stop 

Walk 

Detection rate 

0.6234 

0.6593 

0.6338 


Table 3. Detection rate 


6.2. Qualitative Evaluation 

We apply our method on real world examples to predict a 
set of plausible trajectories of ego-motion and the occluded 
space by foreground objects. Our training dataset is com¬ 
pletely separated from testing data, e.g.. Grocery scene was 
trained to predict IKEA scene. Given a depth image, we es¬ 
timate the ground plane by a RANSAC based plane fitting 
with gravity and height prior. This ground plane is used to 
define the EgoSpace map with respect to the camera direc- 
tioij3 

Eigure and Eigure [7] illustrate our results from the 
EgoMotion dataset. In a testing phase, only a depth im¬ 
age was used while 3D reconstruction of camera poses 
were used in the training phase. In Eigure [7j we show 
(1) image and ground truth ego motion; (2) input depth 
image; (3) EgoSpace map overlaid with the predicted tra¬ 
jectories (gray) and ground truth trajectory (red); (4) re¬ 
projection of the trajectories; (5) reprojection of occluded 
space computed by the EgoSpace map (inset image). Eor 
all scenes, our method predicts the plausible trajectories that 
pass through unexplored space. 

Obstacle Avoidance Our cost function in Equation Q min¬ 
imizes cost difference between trajectories from training 
data and testing data. This precludes a trajectory passing 
through an object unless the retrieved trajectory was par¬ 
tially occluded. EgoSpace map captures the obstacle avoid¬ 
ance as shown in Campus and Grocery. 

^The yaw angle of the gaze direction is assumed to be aligned with the 
camera direction. 


Multiple Plausible Trajectories Our prediction produces a 
number of plausible trajectories that conform to the testing 
scene. Trifurcated trajectories in Campus; bifurcated trajec¬ 
tories in Bus stop; and multiple directions of trajectories in 
Mall 1. 

Occluded Space Discovery The space occluded by fore¬ 
ground objects is discovered by the predicted trajectories. 
The space inside of the shop and behind the person in Eig¬ 
ure [2 the space occluded by the left fence and persons in 
Campus; the space behind the cars and the parking vending 
machine in Bus stop; the space behind the persons and trees 
in Park; the space inside the shop and around the left corner; 
the space behind the column; and the space occluded by the 
fence. 

7. Discussion 

In this paper, we present a method to predict ego-motion 
and occluded space by foreground objects from a first per¬ 
son depth image. EgoSpace map that encodes a likelihood 
of occlusion is used to represent a scene around a camera 
wearer. We associate a trajectory with the EgoSpace map 
in the training phase to predict a set of plausible trajec¬ 
tories given a test depth image. The trajectories that are 
parametrized by a linear combination of compact trajectory 
bases are refined to conform with the test depth image. The 
occluded space is detected by measuring how often the pre¬ 
dicted trajectories invade the occluded space. 


Eigure 6. Our method fails due to mis-estimation ground plane, 
different scene distributions, and failure of depth estimation. 

Limitation Our framework needs three ingredients: similar 
scene training data, ground estimation, and depth computa¬ 
tion. These failure cases are illustrated in Eigure 




7 














































































Ground truth ego-motion Input: depth image EgoSpace map Output: predicted ego-motion + discovered hidden space 



Figure 7. Given a depth image (the second column), we predict a set of plausible trajectories of ego-motion (the forth column) and discover 
the occluded space (the fifth column) using the EgoSpace map (the third column: predicted trajectories (gray) and ground truth trajectory 
(red)). The first column shows an image with ground truth trajectory of ego-motion measured by 3D reconstruction of a first person camera 
(time is color-coded). For more scene description, see Section [6^ 
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